Data on abalone (Haliotis spp.) were collected as part of a research study.
Abalone (via Spanish abulón) is a common name for any of a group of small to very large sea snails, marine gastropod molluscs in the family Haliotidae. The shells of abalones have a low, open spiral structure, and are characterized by several open respiratory pores in a row near the shell’s outer edge. The thick inner layer of the shell is composed of nacre (mother-of-pearl), which in many species is highly iridescent, giving rise to a range of strong, changeable colors, which make the shells attractive to humans as decorative objects, jewelry, and as a source of colorful mother-of-pearl. The flesh of abalones is widely considered to be a desirable food, and is consumed raw or cooked by a variety of cultures.
Live Abalone by Tyler Bell | Flickr | https://flic.kr/p/ahw9aM
Inside of the Abalone Shell by Rojer | Flickr | https://flic.kr/p/7WkcDe
Abalone Fountain 2 by Dave Herrmann | Flickr | https://flic.kr/p/duVk5T
#load the abalone data set
#https://rpubs.com/AlistairGJ/Abalone
data(abalone)
#check the data
head(abalone)
#look at the structure of the data
str(abalone)
## 'data.frame': 4177 obs. of 9 variables:
## $ Type : Factor w/ 3 levels "F","I","M": 3 3 1 3 2 2 1 1 3 1 ...
## $ LongestShell : num 0.455 0.35 0.53 0.44 0.33 0.425 0.53 0.545 0.475 0.55 ...
## $ Diameter : num 0.365 0.265 0.42 0.365 0.255 0.3 0.415 0.425 0.37 0.44 ...
## $ Height : num 0.095 0.09 0.135 0.125 0.08 0.095 0.15 0.125 0.125 0.15 ...
## $ WholeWeight : num 0.514 0.226 0.677 0.516 0.205 ...
## $ ShuckedWeight: num 0.2245 0.0995 0.2565 0.2155 0.0895 ...
## $ VisceraWeight: num 0.101 0.0485 0.1415 0.114 0.0395 ...
## $ ShellWeight : num 0.15 0.07 0.21 0.155 0.055 0.12 0.33 0.26 0.165 0.32 ...
## $ Rings : int 15 7 9 10 7 8 20 16 9 19 ...
#store data as a tibble (shorter name also)
#this is more for personal preference
df<-as.tibble(abalone)
## Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.
#check the data/result
head(df)
#filter by female abalone and drop the Type column
#df<-filter(df, Type %in% c("F"))
#we don't need this anymore because all data represent female abalone
#df<-df[,-1]
We can quickly investigate correlations with GGally. From the pairs plot, we notice:
some of the relationships appear to be strongly non-linear.
some relationships are linear
Male (blue) and Female (red) are distributed similarly. Infants (green) are distributed differently vs adults.
Some correlations are very high (e.g. LongestShell and Diameter ~ 0.987)
Here is a ‘pairs plot’ using the ggpairs() function.
#explore variable relationships
#this will probably take some time to calculate
#skip Rings since we just care about Age now
#ggpairs(data=abalone, columns=c(2:8,10), title="abalone data")
ggpairs(df,
aes(colour = Type, alpha = 0.8),
title="Pairs plot") +
theme_minimal(base_size = 12) #font size
Here is another ‘pairs plot’ using a subset of variables.
df_subset<-df[,c(1,2,3,4,6)] #subset with specific columns
ggpairs(df_subset,
aes(colour = Type, alpha = 0.8),
title="Pairs plot") +
theme_minimal(base_size = 16) #font size
Note: “shucked” translates as “descascarado” (e.g. abulón descascarado)
For example, it may seem reasonable that the variable “ShuckedWeight” seems to relate to the variable “Diameter”. We can therefore create a null hypothesis that states the linear relationship of the two variables. We can investigate the strength of this relationship and determine its significance.
\(H_0: Shucked Weight = Intercept + Slope(Diameter) + Error\)
Use lm() to create a linear model. Note the Y~X format used to indicate which variable is predicting which other variable.
#create the linear model
fit<-lm(df$ShuckedWeight~df$Diameter, data=df)
The summary indicates a decent \(R^2\) value and a significant p-value. From the numbers, Diameter is a decent predictor of Shucked Weight. Is this acceptable?
summary(fit)
##
## Call:
## lm(formula = df$ShuckedWeight ~ df$Diameter, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.23847 -0.06604 -0.01760 0.04500 0.68491
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.455447 0.006535 -69.69 <2e-16 ***
## df$Diameter 1.997675 0.015568 128.32 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09984 on 4175 degrees of freedom
## Multiple R-squared: 0.7977, Adjusted R-squared: 0.7977
## F-statistic: 1.647e+04 on 1 and 4175 DF, p-value: < 2.2e-16
It’s a good idea to create a plot even if it is very simple. We will let X = Diameter (predictor) and Y = ShuckedWeight (predicted). We can see that maybe it is not such a good idea to assume this is a linear relationship. We will need to do something about this. What can we do?
plot(df$Diameter,df$ShuckedWeight,xlab="Diameter (mm)", ylab="Shucked Weight (g)")
###Follow-up Plot We can re-plot the data with the line of best fit suggested from the linear model. These data appear to be related in a non-linear way so we can’t rely on a simple linear regression (yet).
plot(df$Diameter,df$ShuckedWeight,xlab="Diameter (mm)", ylab="Shucked Weight (g)")
abline(fit, col="#008EED", lwd=5)
Nonlinear relationships are very common and one simple solution is to log-transform the data. This fundamentally changes how the results must be interpreted however. You might want to transform one or more variables in your model.
Only independent/predictor variable(s) (X) is/are log-transformed: Interpret the (coefficient/100) as the units of change in Y when X changes by one percent. (e.g. Coefficient = 53; “When X increases by 1%, Y changes by 0.53 units.”)
Only the dependent/response variable (Y) is log-transformed: Interpret the coefficient as the percent change in Y when X changes by one unit. (e.g. Coefficient = 1.9; “When X increases by 1 unit, Y changes by 1.9%.”)
Both dependent/response variable and independent/predictor variable(s) are log-transformed: Interpret the coefficient as the percent change in Y for every 1% increase in X. (e.g. Coefficient = 3.7; “When X increases by 1%, Y changes by 3.7%.”)
We can take the log of two variables and add the resulting columns to the data frame. Then we can plot the untransformed and transformed variables to see the effects. Notice how log transformations linearize the data.
#log transformation of ShuckedWeight
df$lsw<-log(df$ShuckedWeight)
#log transformation of Diameter
df$ld<-log(df$Diameter)
#create a 2x2 grid of plots
par(mfrow=c(2,2))
plot(df$Diameter, df$ShuckedWeight,xlab="", ylab="")
mtext("Diameter (mm)", side=1, line=3, col="black", cex=1.5)
mtext("Shucked Weight (g)", side=2, line=2, col="black", cex=1.5)
plot(df$ld, df$ShuckedWeight,xlab="", ylab="")
mtext("log (Diameter (mm))", side=1, line=3, col="red", cex=1.5)
mtext("Shucked Weight (g)", side=2, line=2, col="black", cex=1.5)
plot(df$Diameter, df$lsw,xlab="", ylab="")
mtext("Diameter (mm)", side=1, line=3, col="black", cex=1.5)
mtext("log (Shucked Weight (g))", side=2, line=2, col="red", cex=1.5)
plot(df$ld, df$lsw,xlab="", ylab="")
mtext("log (Diameter (mm))", side=1, line=3, col="red", cex=1.5)
mtext("log (Shucked Weight (g))", side=2, line=2, col="red", cex=1.5)
Another option here is the cube root. In this case, we can transform the dependent variable and achieve similar results to the log transformation. The data are more normally distributed with this transformation.
#cube root transformation on the response variable
#Interpret as: a one unit increase in the cube root of X is related to Y.
#An increase in diameter from 2 to 8 would be associated with the same change in satisfaction as an increase from 8 to 27 or 27 to 64.
plot((df$Diameter),(df$ShuckedWeight^(1/3)))
hist(df$ShuckedWeight^(1/3))
Use lm() to create a linear model. Note the Y~X format used to indicate which variable is predicting which other variable. We can fit both tranformed models to compare.
#summarize the linear model using log-transformed data
fit_log<-lm(lsw~ld,data=df)
summary(fit_log)
##
## Call:
## lm(formula = lsw ~ ld, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.76395 -0.13156 -0.01231 0.11617 2.45553
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.38801 0.01093 127.0 <2e-16 ***
## ld 2.86907 0.01117 256.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2101 on 4175 degrees of freedom
## Multiple R-squared: 0.9404, Adjusted R-squared: 0.9404
## F-statistic: 6.593e+04 on 1 and 4175 DF, p-value: < 2.2e-16
#create the linear model using cubic root (cr)
fit_cr<-lm(ShuckedWeight^(1/3)~Diameter, data=df)
summary(fit_cr)
##
## Call:
## lm(formula = ShuckedWeight^(1/3) ~ Diameter, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.14820 -0.02860 -0.00414 0.02412 0.43964
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.025056 0.003028 8.276 <2e-16 ***
## Diameter 1.591929 0.007213 220.716 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04625 on 4175 degrees of freedom
## Multiple R-squared: 0.9211, Adjusted R-squared: 0.921
## F-statistic: 4.872e+04 on 1 and 4175 DF, p-value: < 2.2e-16
We can check some basic assumptions about the residuals using some functions from library(olsrr):
The error has a normal distribution (normality assumption).
The errors have mean zero.
The errors have same but unknown variance (homoscedasticity assumption).
The error are independent of each other (independent errors assumption).
NOTE: The following plots use the cubic root model (fit_cr).
#The residuals spread randomly around the 0 line indicating that the relationship is linear.
#The residuals form an approximate horizontal band around the 0 line indicating homogeneity of error variance.
#No one residual is visibly away from the random pattern of the residuals indicating that there are no outliers.
#look for a straight line
ols_plot_resid_qq(fit_cr)
# evenly scattered points around y = 0
ols_plot_resid_fit(fit_cr)
# approximately normal distribution
ols_plot_resid_hist(fit_cr)
How did the \(R^2\) value change following the transformation? Was this an improvement? Our output equation can be created from the summary information.
\(Shucked Weight = InterceptValue + (CoefficientValue * Diameter) + \varepsilon\)
Note: \(\varepsilon \sim N(0, Residual Standard Error^2)\).
#print the model, intercept, and coefficient
fit_cr
##
## Call:
## lm(formula = ShuckedWeight^(1/3) ~ Diameter, data = df)
##
## Coefficients:
## (Intercept) Diameter
## 0.02506 1.59193
#view only the intercept and slope/coefficient
coef(fit_cr)
## (Intercept) Diameter
## 0.02505567 1.59192948
#store the intercept and slope for the linear model
#intercept <- coef(fit_cr)[1]
#slope <- coef(fit_cr)[2]
#print the summary
summary(fit_cr)
##
## Call:
## lm(formula = ShuckedWeight^(1/3) ~ Diameter, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.14820 -0.02860 -0.00414 0.02412 0.43964
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.025056 0.003028 8.276 <2e-16 ***
## Diameter 1.591929 0.007213 220.716 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04625 on 4175 degrees of freedom
## Multiple R-squared: 0.9211, Adjusted R-squared: 0.921
## F-statistic: 4.872e+04 on 1 and 4175 DF, p-value: < 2.2e-16
We can re-plot the data with the line of best fit suggested from the linear model. This relationship is the same as before, but through transformation we can interpret it more accurately: A one unit increase in X results in a cube root unit increase of Y.
plot(df$Diameter,df$ShuckedWeight^(1/3),xlab="", ylab="", pch=19, col="#00000011")
mtext("Diameter (mm)", side=1, line=3, col="black", cex=1.5)
mtext("Cube Root Shucked Weight (g)", side=2, line=2, col="black", cex=1.5)
abline(fit_cr, col="#008EED", lwd=4)
#curve(0.02506+(1.59193*x), add = TRUE, col="red", lwd=1) #same as abline output line
text(0.1, 1.1, expression(Y == 0.02506+1.59193*x), cex = 1.5, col="#008EED", adj=0)
We can plot the same information in the original data set (back-transforming). Note the equation in the plot.
plot(df$Diameter,df$ShuckedWeight,xlab="", ylab="", pch=19, col="#00000011")
mtext("Diameter (mm)", side=1, line=3, col="black", cex=1.5)
mtext("Shucked Weight (g)", side=2, line=2, col="black", cex=1.5)
curve(0.02506+(1.59193*x)^3, add = TRUE, col="#008EED", lwd=4)
text(0.1, 1.4, expression(Y == 0.02506+(1.59193*x)^3), cex = 1.5, col="#008EED", adj=0)
Multiple regression models include more than one predictor for a single predicted variable. Multivariate regression involves multiple independent and multiple dependent variables. For our example, we will only investigate multiple regression.
Simple Linear Regression
\(Y = b_0 + b_1X\)
Multiple Linear Regression
\(Y = b_0 + b_1X_1 + b_2X_2 + b_3X_3\)
Guiding Questions:
How do we create a linear model that includes multiple predictors?
How do we decide which variables are useful?
Guiding Questions:
How do we compare models?
Are more variables helpful?
Guiding Questions:
How do we select the ‘best’ model?
How do we defend our decision?
We can add variables to our original model and test them to see if the addition was useful.
Fit1 is a model containing the original cube-root transformation of Shucked Weight and Diameter.
fit1 <- lm(ShuckedWeight^(1/3)~Diameter,data=df)
summary(fit1)
##
## Call:
## lm(formula = ShuckedWeight^(1/3) ~ Diameter, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.14820 -0.02860 -0.00414 0.02412 0.43964
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.025056 0.003028 8.276 <2e-16 ***
## Diameter 1.591929 0.007213 220.716 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04625 on 4175 degrees of freedom
## Multiple R-squared: 0.9211, Adjusted R-squared: 0.921
## F-statistic: 4.872e+04 on 1 and 4175 DF, p-value: < 2.2e-16
#extract only the r-squared value if needed
summary(fit1)$r.squared #or use adj.r.squared
## [1] 0.9210632
Fit2 contains Fit1 + LongestShell
fit2 <- lm(ShuckedWeight^(1/3)~Diameter+LongestShell,data=df)
summary(fit2)
##
## Call:
## lm(formula = ShuckedWeight^(1/3) ~ Diameter + LongestShell, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.13519 -0.02658 -0.00367 0.02234 0.44587
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.009158 0.003057 -2.996 0.00275 **
## Diameter 0.485106 0.041042 11.820 < 2e-16 ***
## LongestShell 0.926857 0.033915 27.329 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04261 on 4174 degrees of freedom
## Multiple R-squared: 0.933, Adjusted R-squared: 0.933
## F-statistic: 2.908e+04 on 2 and 4174 DF, p-value: < 2.2e-16
#extract only the r-squared value if needed
summary(fit2)$r.squared #or use adj.r.squared
## [1] 0.9330438
Fit3 contains Fit2+ Height
fit3 <- lm(ShuckedWeight^(1/3)~Diameter+LongestShell+Height,data=df)
summary(fit3)
##
## Call:
## lm(formula = ShuckedWeight^(1/3) ~ Diameter + LongestShell +
## Height, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.13718 -0.02603 -0.00356 0.02268 0.44535
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.007696 0.003037 -2.534 0.0113 *
## Diameter 0.420354 0.041445 10.142 <2e-16 ***
## LongestShell 0.911585 0.033691 27.058 <2e-16 ***
## Height 0.236187 0.028354 8.330 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04226 on 4173 degrees of freedom
## Multiple R-squared: 0.9341, Adjusted R-squared: 0.9341
## F-statistic: 1.973e+04 on 3 and 4173 DF, p-value: < 2.2e-16
#extract only the r-squared value if needed
summary(fit3)$r.squared #or use adj.r.squared
## [1] 0.9341389
Compare models with ANOVA
anova(fit1, fit2)
anova(fit2, fit3)
The models can be neatly summarized using stargazer().
#summarize models with stargazer
#Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
#R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
library(stargazer)
stargazer(fit1, fit2, fit3,type="text",no.space=TRUE,omit.stat=c("f"))
##
## =========================================================================
## Dependent variable:
## -----------------------------------------------------
## ShuckedWeight
## (1) (2) (3)
## -------------------------------------------------------------------------
## Diameter 1.592*** 0.485*** 0.420***
## (0.007) (0.041) (0.041)
## LongestShell 0.927*** 0.912***
## (0.034) (0.034)
## Height 0.236***
## (0.028)
## Constant 0.025*** -0.009*** -0.008**
## (0.003) (0.003) (0.003)
## -------------------------------------------------------------------------
## Observations 4,177 4,177 4,177
## R2 0.921 0.933 0.934
## Adjusted R2 0.921 0.933 0.934
## Residual Std. Error 0.046 (df = 4175) 0.043 (df = 4174) 0.042 (df = 4173)
## =========================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01